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ABSTRACT 



Two recent simulation studies were conducted to aid 



in the diagnosis and interpretation of equating differences found 
between random and matched (nonrandom) samples for four commonly used 
equating procedures: (1) Tucker; (2) Levine equally reliable; (3) 
Chained equipercent i le observed-score; and (4) three-parameter, item 
response theory true-score equating. For these simulations logistic, 
test forms were equated to themselves, a situation that does not 
pattern reality. In the current simulaticn, test variation war added 
as an additional variable for study. The results of the current 
simulation confirmed the results of the previous two simulations and 
support the prediction based on . theoretical grounds that 
observed-score equating methods, such as Tucker and Chained 
equipercent i le , are more affected by sample variation than are a 
true-score method (IRT) or an observed score method based on 
true-score assumptions (Levine equally reliable). The results further 
suggest that matching equating samples on the basis of a fallible 
measure of ability, such as anchor test score, is not advisable for 
any equating method studied except possibly the Tucker method. Three 
figures and one table present simulation results, and an appendix 
contains two additional tables. (Contains nine references.) 
(Author) 
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Abstract 

Two recent simulation studies were conducted to aid in the diagnosis and 
interpretation of equating differences found between random and matched 
(nonrandom) samples for four commonly used equating procedures: Tucker, 
Levine equally reliable, and Chained equipercentile observed- score procedures 
and the 3PL IRT true-score equating procedure. For these simulations, test 
forms were equated to themselves, a situation that does not pattern reality. 
In the current simulation, test variation was added as an additional variable 
for study. The results of the current simulation confirmed the results of the 
previous two simulations and support the prediction based on theoretical 
grounds that observed- score equating methods, such as Tucker and Chained 
equipercentile, are more affected by sample variation than are a true -score 
method (IRT) or an observed- score method based on true-score assumptions 
(Levine equally reliable) . The results further suggest that matching equating 
samples on the basis of a fallible measure of ability, such as anchor test 
score, is not advisable for any equating method studied except possibly the 
Tucker method. 
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The Effects on Observed- and True-Score Equating Procedures 
of Matching on a Fallible Criterion: 
A Simulation with Test Variation 

INTRODUCTION 

Recently, in an attempt to circumvent differences in common item or 
anchor test equating results across methods that are caused by samples that 
differ in ability, researchers at Educational Testing Service have begun to 
study the effects on the commonly used equating procedures of matching one of 
the equating samples being used to the other, through scores on the set of 
common items or anchor test. Lawrence and Dorans (1990) were the first to 
study the effects of matching, using anchor test scores on the Scholastic 
Aptitude Test (SAT) , and a number of other studies of matching followed the 
Lawrence and Dorans work. These studies have recently been published as a set 
in an edition of Applied Measurement in Education (1990) . 

The work described in this paper may be vi'^wed as an extension of the 
real data SAT matching study conducted by Lavrrence and Dorans (1990) and two 
simulation studies involving matching with SAT data, one by Stocking, Eignor, 
and Cook (1988) and the other by Eignor, Stocking, and Cook (1990). Because 
of this, some of the details of the standard SAT data collection design will 
first be reviewed; then the results of the above-mentioned studies will be 
briefly discussed. For in-depth details about matching procedures, the reader 
should consult Dorans (1990). 

Figure 1 depicts the basic SAT equating data collection design, which 
essentially represents an equating design linking the new form, labelled NEW, 
to two old forms OLDl and 0LD2. The specific old forms to be used in the 
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equating are established in the SAT braiding plan (Angoff, 1971); in general, 
the populations taking forms NEW and OLDl will be populations of similar 
ability (data for form OLDl will have been collected at a corresponding 
administration (same time of year) in a year previous to that in which form 
NEW was given) , while the group of examinees taking form 0LD2 will represent 
either a more or less able candidate population (data for form 0LD2 will have 
been collected at a noncorresponding administration in a year previous to that 
in which form NEW was given) . Form NEW is linked to OLDl via one anchor test 
(EQl) and to 0LD2 via another anchor test (EQ2) . These anchor equatings are 
performed using representative (random) samples from the populations taking 
the forms. Typically, the average of the anchor equatings to the two old 
fojnns is taken as the operational conversion for the new form. 



Insert Figure 1 about here 



In the Lawrence and Dorans (1990) study, the authors focused on the 
equating of NEW to 0LD2 and performed both conventional observed- score 
(Tucker, Levine equally reliable, and Chained equipercentile; see Angoff, 
1984, Chained equipercentile is Design V)) and three -parameter logistic (3PL) 
item response theory (IRT) true-score (see Lord, 1980) equatings under both 
representative (random) and matched (nonrandom) sampling conditions. In the 
matched (nonrandom) condition, scores on EQ2 were used in an attempt to match 
the ability level of the sample taking 0LD2 to the ability level of the sample 
taking NEW. That is, the distribution of anchor test scores was made to be 
the same in the 0LD2 and NEW samples, in the process altering the 
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characteristics of the 0LD2 sample so that it was no longer a representative 
(random) sample. 

Consistency of equating results, and particularly scaled score means, 
across the representative and matched sample conditions was used as the 
criterion in the Lawrence and Dorans study. Lawrence and Dorans found that 
the means for Tucker equatings varied the least across the two sampling 
conditions and that the means for the Levine equally reliable. Chained 
equipercentile , and 3PL IRT equating methods, while varying across the two 
conditions , also tended to converge to the mean from the Tucker equating under 
matched sampling conditions. 

One potential problem with using consistency as the criterion is that 
consistent equating results may be different from the "true" equating results, 
were they known. In other words, the consistent Tucker equating results may 
have differed more from the "true" equating results in the Lawrence and Dorans 
study than the inconsistent Levine or IRT equatings. The lack of availability 
of "true" equating results suggested the need for a simulation study. 

Stocking, Eignor, and Cook (1988) developed a general simulation model 
and then performed a sequence of simulations and subsequent equatings based on 
that model that addressed a ntunber of specific issues in the application of 
both conventional (Tucker, Levine equally reliable, and Chained 
equipercentile) and IRT-based (3PL true-score) equating methodologies, many of 
which were brought out in the Lawrence and Dorans (1990) study. The purpose 
of their study was to investigate the impact on the four equating procedures 
just mentioned of: 1) differences in abilities of samples used for equating, 
both when each examinee has complete data (an unrealistic setting) and when 



ERIC 



8 



Test Variation 
4 



each examinee has missing data (a more realistic setting) ; 2) subsequent 
matching of samples on IRT ability, an infallible measure of ability (an 
unrealistic setting) ; and 3) subsequent matching of samples on anchor test 
observed score, a fallible measure of ability (a more realistic setting). To 
be consistent with the Lawrence and Dorans study, the effect on scaled score 
means of these various experimental conditions was chosen for study. 

The results of the Stocking et al. study the prediction based on 
theoretical grounds that observed- score equating methods, such as Tucker and 
Chained equipercentile , are more affected by sample variation than are a true- 
score equating method (3PL IRT) and an observed- score method based on true- 
score assumptions (Levine equally reliable) . Their results further suggested 
that matching equating samples on the basis of a fallible measure of ability 
is not advisable for any equating method studied other than Tucker. 

The results of the Stocking et al. study, i.e., the scaled score means 
and standard deviations, were not completely inconsistent with the Lawrence 
and Dorans (1990) findings for SAT-Verbal in that the Stocking et al. results 
corresponded fairly closely to the results found by Lawrence and Dorans for 
one of the eight verbal forms they studied. However, the Stocking et al. 
results were fairly inconsistent with results for the other verbal forms 
studied by Lawrence and Dorans and quite inconsistent with the Lawrence and 
Dorans findings for SAT -Mathematical . The conclusions of the Stocking et al. 
study were based on a single sequence of simulations, and becaus^ the results 
differed a good deal from the Lawrence and Dorans real data results , a 
replication of the Stocking et al. study was undertaken, using a different 
SAT- Verbal form and completely new samples. In addition, the samples used for 
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the replication were based on an examinee group that was a good deal more able 
(about a fifth of a standard deviation on the SAT scaled score metric) than 
the examinee group used to define samples in the original study. The results 
of the replication were reported in Eignor, Stocking, and Cook (1990), The 
results of the replication phase essentially confirmed the results of the 
Stocking et al . (1988) study and, collectively, the results from both studies 
provided a reasonably strong basis for making recommendations about whether to 
match on a fallible criterion, such as anchor test score. 

However, in both of these simulation studies, the design called for 
variations in sample ability and the completeness of response data while 
controlling for test variation. Hence, tests were equated to themselves. 
While the results of the studies were seen by some as being informative, they 
do not pattern reality in equating the SAT, where a new form is equated to 
different old forms. 

The purpose of the present study was to introduce test variation into 
the simulation procedure, thereby providing an indication of the effects of 
test variation over and above the effects of variations in sample ability and 
completeness of response data on the anchor test matching process. Outside of 
the introduction of test variation (i.e., there were three distinct forms 
being used in the equating, rather than one), all other elements of this 
simulation completely paralleled the previous two simulations (Stocking et 
al., 1988; Eignor et al., 1990). Selected results from the previous two 
simulations will be presented in this paper so they may be contrasted to the 
results obtained with the introduction of test variation. 
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THE STUDY DESIGN 
The Definition of True Item and Person Parameters 

For the sequence of simulations performed, true item and person 
parameters were required. They could, of course, have been invented. It was 
more realistic, however, to use existing parameter estimates, but treat them 
as if they were true. It seems reasonable to assume that such a definition of 
truth captures at least some of the predominant features of actual data, stich 
as the spread of abilities and item difficulties. For this purpose, the 
results of a LOGIST calibration (Wingersky, Barton, & Lord, 1982) of an 85- 
item SAT-Verbal test form (administered in two separately -timed sections) plus 
a 45 -item anchor test or equating section were used as the true item 
parameters for the new form (NEW) and equating section EQl (see Figure 1) . 
The results of another LOGIST calibration of the same 85- item form plus a 40- 
item anchor test section supplied the true item parameters for EQ2. The 
results of a LOGIST calibration of another 85 -item Verbal test form plus the 
associated 45-item anchor test supplied the true item parameters for OLDl. 
Finally, the results of a fourth LOGIST calibration of still another 85 -item 
Verbal test form plus the associated 40 -item anchor test supplied the true 
parameters for 0LD2. All parameters were placed on a common scale using the 
characteristic curve transformation method (Stocking and Lord, 1983) , applied 
to either the 45-item or 40-item anchor test from the separate calibrations. 
Forms OLDl and 0LD2 were the actual Verbal forms to which NEW was eqxaated at 
its first operational administration. 
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True person parameters were defined to be the ability estimates obtained 
from a random sample of 3004 real examinees drawn from the total group that 
took NEW and its associated equating section EQ2. This total group had an SAT 
scaled score mean of 441 and scaled score standard deviation of 108. 

Two population distributions of true ability were then defined. The 
first was defined to be exactly like the distribution of true person 
parameters, with mean true ability of -.02 and standard deviation of true 
ability equal to 1.05. A second population was defined to be less able, with 
mean true ability of - . 35 , but having the same standard deviation as the first 
population (1.05). 

A total of seven independent samples of size N-3000 were drawn as 
follows : 

Drawn from Sample Mean Sample S.D. 



Sampl 3 Population Ability of Ability To be Administered 

1 1 -.03 1.05 NEW + EQl 

2 1 .00 1.07 NEW + EQ2 

3 1 -.05 1.06 OLDl + EQl 

4 1 -.03 1.05 0U)2 + EQ2 

5 2 -.34 1.07 0LD2 + EQ2 
6^ 2 -.06 1.06 0U)2 + EQ2 
7* 2 -.05 1.04 0U)2 + EQ2 



The Generation of Response Data 
Two types of response data were generated for each simulated examinee 
(simulee) -- complete data response strings and response strings reflecting 



missing data. Complete data response strings were generated in the standard 



Sample 6 was matched to sample 2 using the complete data observed formula- 
score distribution of sample 2 on EQ2. Sample 7 was matched to sample 2 using 
the missing data observed formula- score distribution of sample 2 on EQ2 . 
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fashion using the simulee's true ability and the item's true 3PL parameters to 
generate the model predicted probability of a correct response, which was than 
compared to a random number selected from a uniform [0,1] distribution (see 
Lord, 1980). 

The missing data response strings were generated from empirically-based 
models of speededness (for not reached items) and omitting behavior. With 
these models , both the number of items reached and the number of items omitted 
are functions of ability level. These models and the procedure for simulating 
the two kinds of missing data are described in detail in Stocking et al. 
(1988) . 

The Design of the Calibrations and Equatings 
The simulated responses from the seven samples of slmule&s to the test 
forms and equating sections were combined into six separate concurrent LOGIST 
runs, each representing an. experimentftl condition. The design of each LOGIST 
run was the same, and patterns the usual SAT data collection design presented 
in Figure 1: 

NEW EOl E02 OLDl 0LD2 

Sample 1 X X 

Sample 2 X X 

Sample 3 XX 
Sample Y X X 

(Y-4,5,6, or 7) 

The data for all samples in a LOGIST run were either complete or contained 
missing data. Sample 1 was administered the new form (NEW) and one anchor 
test (EQl); Sample 2 was administered the new form (NEW) and another anchor 
test (EQ2); Sample 3 was adainistered the first anchor test (BQl) and the 



ERIC 



13 



Test Variation 
9 



first old form (OLDl) ; and a final sample (either sample 4, 5, 6, or 7) was 
administered the second anchor test (EQ2) and the other old form (0LD2) . The 
samples taking EQ2 and 0LD2 In each LOGIST run, samples 4-7, were drawn in the 
following fashion. Sample 4 was drawn randomly from the same population as 
the other samples; Sample 5 was drawn randomly from the lower ability 
population; Sample 6 was drawn from the lower ability population to match the 
distribution of complete data observed formula -scores obtained by sample 2 on 
EQ2; and Sample 7 was drawn from the lower ability population to match the 
distribution of missing data observed formula- scores obtained by sample 2 on 
EQ2. 

From the item parameter estimates derived from each of the LOGIST runs 
or from the observed- score data for the samples used in the runs, the new form 
was equated to each old form using the Tucker, Levlne equally reliable. 
Chained equlpercentlle , and 3PL IRT equating methods. The two equatings were 
also averaged to produce a final equating. All old forms were placed on the 
SAT 200 to 800 scaled score metric by the nonlinear equating originally 
derived for each of the forms when they were given operationally for the first 
time as new SAT forms. Projected scaled score means and standard deviations 
were computed for each single equating and each average using sample? of over 
90,000 examinees who took NEW at its initial equating administration. 



The series of simulations were designed to study six experimental 
conditions, shown in the following table, which contains a letter for each 
experimental condition: 



The Experimental Conditions 
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True Ability Distribution 



Equivalent 



Unequal 



Equivalent by Matching 



Complete data 
Missing data 



A 
D 



B 
E 



C 
F 



Condition A, complete data and equivalent samples . is a benchmark 
condition in that, while unlikely to be realized in practice, it represents 
the best circumstances for any equating method. In addition, samples have 
been chosen to be equivalent on the basis of an infallible criterion. 
Condition B, complete data and unequal samples , provides for the exploration 
of the effects of different sample abilities while still maintaining the ideal 
situation of complete data for all simulees. Condition C, complete data and 
matched sample s, provides for the explanation of the effects of matching on a 
fallible criterion while still maintaining the ideal situation of complete 
data for all simulees. Condition D, missing data and equivalent samples , is a 
more realistic condition in that samples now incorporate missing data. In 
this condition, as in Condition A, samples have been chosen to be equivalent 
on the basis of an infallible criterion. Condition E, missing data and 
unequal samples , represents what is typically obtained in an SAT equating of 
NEW to 0LD2 in the absence of any further data manipulation. Condition F, 
missing data and matched samples , represents the matching procedure employed 
by Lawrence and Dorans (1990); that is, matching samples on the basis of a 
fallible criterion in an attempt to achieve the ideal condition of equivalent 
samples . 
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RESULTS AND DISCUSSION 

Table 1 shows the projected scaled score means and standard deviations 
for all individual equatings performed and for the averages. Tables Al and A2 
of the Appendix display comparable data from Stocking et al. (1988) and Elgnor 
et al. (1990). In Figure 2, plots of the projected scaled score means for the 
individual equatings (not the averages) are displayed. Figure 3 contains 
comparable plots of the results from the Stocking et al. (1988) and Eignor et 
al. (1990) studies. In both Figures 2 and 3, the left side gives the results 
of the equatings of NEW to OLDl, and the right side gives the results for the 
equatings of NEW to 0LD2 . The experimental conditions are positioned along 
the horizontal axis. The projected scaled score means are read from the. 
vertical axis. The points for a particular equating method are connected by 
dashed or solid lines, Identified in the legend for each method, for the 
complete data cases and again for the missing data cases, to make the plots 
easier to read. 



Insert Table 1 and Figures 2 and 3 about here 

Table 1 and Figure 2 show that the differences among projected scaled 
score means are relatively small, although generally larger than the 
differences seen in Figure 3, where tests were equated to themselves. The 
importance of these differences among scaled score means Is not possible to 
judge, however, since approximate standard errors of equating have not been 
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developed for all methods (i.e., the IRT standard errors of equating have not 
been developed to date) . 

To evaluate these results , it seems useful to compare the results of 
each equating method across experimental conditions to its own value in the 
"benchmark" condition. This condition, shown to the far left of each subplot, 
is the one in which data are complete for each simulee and all samples of 
simulees are drawn from the same ability distribution. In addition, this 
condition, along with the comparable missing data condition (condition D) , 
represent "true" conditions in the sense that, in both cases, samples have 
been matched on the basis of an infallible criterion. 

New Form Equated to Old Form 1 

Conventional equating methods (Tucker, Levine equally reliable, and 
Chained equipercentile) used for equating NEW to OLDl are not affected by 
different samples taking 0LD2 since these samples do not enter into the 
equating. Thus, the scaled score means for the conventional methods are 
identical for conditions involving complete data (A, B, and C) , and also 
identical, but different, for conditions involving missing data (D, E, and F) . 
In contrast, since all test forms are calibrated concurrently, 3PL IRT 
equating results vary slightly across conditions in which the samples taking 
the other old form vary. 

All equating methods are affected by missing responses in the response 
strings for both the NEW and OLDl samples (conditions D vs. A, E vs. B, and F 
vs. C,), although, for this simulation. Chained equipercentile equating 
appears less affected than the other methods. 
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New Form Equated to Old Form 2 

These equatings, shown in the right-hand subplot of Figure 2, are the 
interesting ones --by design they are most affected by the experimental 
conditions. As seen in Figure 2 and also in Table 1, the benchmark conditions 
for all equating methods are different from the benchmark conditions for the 
equating of NEW to OLDl. The Tucker benchmark conditions are most different -- 
over one and a half scaled score points; the Leviue equally reliable benchmark 
conditions are least different -- less than a fifth of a scaled score point. 
Differences for the Chained equipercentile and 3PL IRT benchmark conditions 
are about «.he same. 

The most striking aspect of these equatings, as was the case for the 
equatings from Stocking et al. (1988) and Eignor et al. (1990) depicted in 
Figure 3, is the sensitivity of observed- score equating methods to differences 
in true sample ability. The introduction of samples of unequal ability, 
whether in the complete data situation (condition B) or in the missing data 
condition (condition E) has the largest impact on Tucker equating, and less 
but substantial impact on Chained equipercentile equating. The remaining two 
methods, Levine equally reliable and 3PL IRT, seem to be affected to about the 
same degree. 

As in the OLDl equatings, the introduction of missing dafa (conditions D 
vs. A, conditions E vs. B and conditions F vs . C) also impacts the projected 
means, making them slightly lower for all equating methods. 

A particular hypothesis presented by Charles Lewis (personal 
communication, October 21, 1987) for changes in 3PL IRT equating results 
across missing data representative (random and unequal) sample conditions 

18 

ERIC 



Test Variation 
14 



(condition E) and matched sample conditions (condition F) , and discussed in 
Lawrence and Dorans (1990), is demonstrated by the decreases in the projected 
scaled score means between conditions E and F. Tucker and Levine equally 
reliable are identical, as they must be, under complete data and missing data 
n-^cched sample conditions (both models reduce to the direct nonanchor linear 
equating method in which means and Standard deviations are set equal for the 
new form and old form samples ; see Lawrence & Dorans , 1990) , and the Chained 
equipercentile equating is reasonably close to them. 

If the benchmark condition (Condition A) is used as a criterion, it 
seems clear that the 3PL IRT and Levthe equally reliable equatings vary least 
across all experimental conditions. If the Missing Data, Equivalent Samples 
condition (D) is a more practical criterion, in other missing data conditions 
(E and F) , all equating methods except Tucker come closer to this criterion 
when representative (i.e., random and unequal) samples are used than when 
matched samples are used. The matching process appears to improve the Tucker 
method slightly, while making the other methods much worse. 

It is useful to compare the shapes of the plots of means for equating 
NEW to 0LD2 contained in Figures 2 and 3. Altl">ugh these plots differ 
somewhat for particular equating methods (i.e., compare the Tucker B to C 
conditions for the replication to the comparable B to C conditions for the 
original study and the current study- -test variation), in general they are 
comparable in appearance and the conclusions that may be drawn from all three 
are the same. In addition, while the introduction of test variation seems to 
exacerbate slightly the differences in means across conditions for the various 
equating methods when compared to the situation when a test is equated to 
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itself, this change in the differences was not as large as anticipated. Based 
on the results of this single simulation with test variation, it would appear 
that variations in sample ability and the completeness of response data are 
greater contributors to differences in means resulting from the various 
equating methods than are differences in the forms being equated. This 
conclusion may be partly or wholly due to the fact, however, that forms of the 
SAT are developed to tight content and statistical specifications, and such 
results may not have been observed if the • simulation were done using data from 
a test where forms were not so parallel. 

The results of chis study are essentially the same as the results of the 
previous studies by Stocking et al. (1988) and Eignor et al. (1990) and 
suggest that If Levine equally reliable. Chained equipercentile , or 3PL IRT 
equatings are to be used, more reasonable results are obtained using 
representative (i.e., random and unequal) samples. If Tucker equating is to 
be used and there is missing data, better results are obtained with matched 
samples than with representative but unequal sainples. However, if the 
decision concerning the choice of eqxiating procedure is to be made after the 
sampling decision, then these results suggest that it is better to use the 
representative sampling that typically occurs in SAT eqxiating situations, and 
to avoid selecting the Tucker method. 

CONCLUSIONS 

As mentioned in the introduction, one criticism of the simulation 
studies on matching done by Stocking et al. (1988) and Eignor et al. (1990) is 
that their design called for variations in sample ability and the completeness 
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of response data while controlling for test variation. Tests were equated to 
themselves, which does not pattern reality in equating the SAT, where a new 
form is equated to different old forms. Hence, the results were seen by some 
as tenuous, because they were not reality-based. 

In the current study, test variation was introduced to pattern reality. 
The results of this study confirm the results of the previous two studies and, 
collectively, all three studies form a strong foundation for making 
recommendations about whether to match on a fallible criterion- -anchor test 
score. Only for Tucker equating are better results generally obtained when 
samples of unequal ability are matched on this fallible criterion. 

Caveats presented in the conclusions sections of the previous studies 
are again relevant. The results of this and the previous studies should be 
examined from the viewpoint that response data in these simulations were 
generated according to the 3PL model, with some specific model violations 
introduced to incorporate missing data. These circumstances may favor the 3PL 
IRT equating results. Also, it is really not possible to draw definitive 
conclusions about the importance of the equating differences seen in these 
studies until estimates of standard errors of equating for all equating 
methods studied can be produced. However, the very similar patterns of 
results across the three studies does allow one to conclude, even without the 
standard errors, that matching samples on anchor test scores is not the best 
way to proceed in dealing with equating samples of unequal ability. 



21 



Test Variation 
17 



References 



Angoff, W. H. (1971). The College Board Admissions Testing Program: A 

technical report on research and development activities relating to the 
Scholastic Aptitude Test and Achievement Tests . New York: College 
Entrance Examination Board. 

Angoff, W. H. (1984). Scales, norms, and equivalent scores . Princeton, NJ: 
Education Testing Service. 

Dorans, N. J. (1990). The equating methods and sampling designs. A pplied 
Measurement in Education . 3.. 3-17. 

Eignor, D. R. , Stocking, M. L. , & Cook, L. L. (1990). Simulation results of 
effects on linear and curvilinear observed- and true -score equating 
procedures of matching on a fallible criterion. A pplied Measurement in 
Education . 3, 37-52. 

Lawrence, I. M. , 6e Dorans, N. J. (1990). The effect on equating results of 

matching on an anchor test. Applied Measurement in Education . 3, 19-36. 

Lord, F. M. (1980). A pplications of item response theory to practical 
testing problems . Hillsdale, NJ: Lawrence Erlbatun, Assoc. 

Stocking, M. L. , Eignor, D. R. , & Cook, L. L. (1988). Factors affecting the 
sample invariant properties of linear and curvilinear observed- and 
true -score equating procedures (RR-88-41). Princeton, NJ: Educational 
Testing Service. 

Stocking, M. L. , & Lord, F. M. (1983). Developing a common metric in item 
response theory. A pplied Psychological Measurement . 2. 201-210. 

Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982). LOGIST V user's guide . 
Princeton, NJ: Educational Testing Service. 



ERIC 



22 



Test Variation 
18 



Figure 1. 



Data collection design for equating the SAT 
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Notes: An X denotes the specific total test and anchor 
test taken by a specific sample. 

Samples 1 and 2 are representative samples from the same 
total group . 

Sample 3 is a sample from a different total group that is 
similar in ability to the total group from v^ich Samples 1 and 2 
were drawn. 

Sample 4 is a sample from a different total group that is 
dissimilar in ability to the total group from which Samples 1 and 2 
were drawn. 
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Table 1 

Projected Scaled Score Means and Standard Deviations £or All 
Equating Methods and All Experimental Conditions 

- Test Variation - 

Tucker 



HEW to OLDl 



Complete Data, Equivalent Samples (Benchmark) 

Complete Data, Unequal Samples 

Complete Data, Matched Samples 

Mlaslns Data, Equivalent Samples 

Mlaslng Data, Unequal Samples 

Mlaalns Data, Matched Samples 



Mean 

441.14 
441.14 
441.14 
442.37 
442.37 
442.37 



HEW to 0LD2 



S.D. 

110.74 
110.74 
110.74 
112.20 
112.20 
112.20 



Mean 

439.45 
433.67 
434.12 
437.94 
432.54 
433.50 



S.D. 

108.02 
104.77 
108.83 
106.31 
102.44 
105.65 



Mean 

440.14 
437.24 
437.47 
439.99 
4-7.29 
437.77 



Average 



S.D. 

110.73 
107 . 52 
109.55 
109.01 
107.09 
108.69 



Complete Data, Equivalent Samples 
Complete Data, Unequal Samples 
Complete Data, Matched Samples 
Mlaalng Data, Equivalent Samples 
Mlaslng Data, Unequal Samples 
Mlaalng Data, Matched Samples 



Levlne equally reliable 

HEW to OLDl 


HEW to 


0LD2 


Averaxe 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


(Benchmark) 440.22 


110.73 


440.05 


109.03 


440.08 


110.79 


440.22 


110.73 


438.44 


104.49 


439.25 


107.48 


440.22 


110.73 


434.12 


108.83 


437.09 


109.66 


441.22 


112 . 56 


438.43 


107 . 13 


439.67 


109.62 


441.22 


112 . 56 


436.73 


102.04 


438.82 


107.07 


441.22 


112 . 56 


433.50 


105.65 


437.20 


108.87 



(Sialned equlpercentlle 



Complete Data, Equivalent Samples (Benchmark) 

Complete Deta, Unequel Samples 

Coiaplete Data, Matched Samples 

Mlaalng Data, Equivalent Samples 

Mlhslng Data, Unequal Samples 

Mlaslng Data, Matched Samples 



HEW to 


OLDl 


HEW to 


0LD2 


Averaxe 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


440.48 


109.96 


439.93 


108.79 


440.17 


109.29 


440.48 


109.96 


436.91 


106.48 


438.66 


108.13 


440.48 


109.96 


433.58 


108.39 


436.99 


109 . 09 


441.09 


110.51 


437.90 


106.85 


439.38 


108.54 


441.09 


110.51 


435.07 


103.81 


437.96 


107.01 


441.09 


110.51 


432.94 


104.99 


436.90 


107.61 



Complebe Data, Equivalent Samples (Benchmark) 

Complete Data, Unequal Samples 

Complete Data, Matched Samples 

Mlaslng Data, Equivalent Samples 

Mlaslng Data, Unequal Samples 

Missing Data, Matched Samples 



HEW to 


OLDl 


HEW to •OLD2 


Averaxe 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


439.70 


107.92 


440.13 


107.36 


439.91 


107.63 


439.92 


107.53 


437.97 


106.45 


438.95 


106.96 


439.87 


107.77 


434.71 


106.86 


437.29 


107.29 


440.90 


107.11 


438.45 


105.87 


439.68 


106.46 


441.16 


106.74 


436.94 


104 . 93 


439.05 


105.79 


440.93 


106.86 


433. 6S 


104.02 


437.29 


105.40 
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Tabl* Al 

FroJ«ct«d Scaled Scor* Maans and Standard Davlatlona for All 
Equating Mathoda and All Exparimantal Conditions 

- Original Study - 

Tuckar 



Complata Data, Equivalent Samplas 
Complata Data, Unaqual Samplas 
Complata Data, Matchad Samplas 
Misains Data, Equivalent Samples 
Missing Date, Unequal Samples 
Mlasing Data, Matched Sanqples 



Complete Date, Equivalent. Samples 
Complete Data, Unequal Samples 
Complate Data, Matched Samplas 
Mlaaing Data, Equivalent Saoqplas 
Mlaaing Data, Unaqual Samplas 
Mlaaing Data, Matchad Samplas 





NEW to OLDl 


HEW to 


0LD2 




Averace 




Mean 


S.S. 


Mean 


S.D. 


Mean 


S.D. 


(Banchmarlc) 


420.72 


112.39 


421.22 


108.52 


420 


96 


110.44 




420.72 


112.39 


414.90 


106.31 


417 


80 


109.34 




420.72 


112.39 


416.83 


111.09 


418 


76 


111.73 




422.10 


111.14 


421.71 


109.14 


421 


89 


110.13 




422.10 


111.14 


415.35 


107.02 


418 


71 


109.07 




422.10 


111.14 


417.95 


108.92 


420 


02 


110.02 


Lavine 


equally 


reliable 














NEW to OLDl 


HEW to 


0LD2 




Average 




Mean 


S.D. 


Haan 


S.D. 


Mean 


S.D. 


(Benchmark) 


420.89 


112.30 


420.79 


107.55 


420 


.83 


109.91 




420.89 


112.30 


420.06 


106.97 


420 


.47 


109.62 




420.89 


112.30 


416.83 


111.09 


418 


.85 


111.68 




422.31 


110.87 


421.15 


108.42 


421 


.73 


109.63 




422.31 


110.87 


420.42 


108.01 


421 


.36 


109.43 




422.31 


110.87 


417.95 


108.92 


420 


.13 


109.88 



Chained aquiperc entile 



Complate Data, Equivalent Samplas (Benchmark) 

Complate Data, Unaqual Samplas 

Complata Data, Matched Samples 

Mlaaing Data, Equivalent Samples 

Mlasing Data, Unequal Samples 

Missing Data, Matched Samples 



Complete Data, Equivalent Samples (Benchmark) 
Complete Data, Unequal Samples 
' Complete Data, Matched Samples 
Miaaing Data, Equivalent Samples 
Mlasing Data, Unequal Samples 
Mlasing Data, Matched Samples 



HEW to OLDl 


HEW to 


0LD2 


AversRe 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


420.74 


112.77 


420.82 


107.85 


420.81 


110.24 


420,74 


112.77 


418.76 


107.39 


419.78 


110.00 


420.74 


112.77 


416.86 


111.10 


418.84 


111.86 


422.00 


110.67 


421.05 


108.24 


421.52 


109.38 


422.00 


110.67 


419.04 


108.02 


420.52 


109.28 


422.00 


110.67 


417.82 


108.93 


419 90 


109.72 


IHT 












HEW to OLDl 


HEW to 


0LD2 


Aversxe 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 



422.12 
422.35 
422.34 
422.52 
422.77 
422.50 



111.10 
110.99 
111.18 
110.37 
110.17 
110.33 



419.79 
419.70 
417.11 
420.46 
420.12 
419.07 



109.13 
109.56 
110.84 
108.94 
109.90 
108.68 



420.95 
420.76 
419.73 
421.49 
421.45 
420.79 



110.12 
110.27 
111.01 
109.65 
110.04 
109.50 
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T«bla A2 

Projactad Scalad Scora Maans and Standard Davlatlons £or All 
Equating Mathods and All Exparlmantal Conditions 

- Raplication " 

Tuckar 







HEW to 


OLDl 


NEW to 


0LD2 


Avaraxa 






Ma an 


S.D. 


Mean 


S.D. 


Ma an 


S.D. 


Cooplata Data 


Equivalant Samplas (Banchmark) 


439. OS 


108.92 


440. S2 


106.29 


439.78 


107.60 


Complata Data 


Unaqual Samplas 


439. OS 


108.92 


43S.20 


10S.67 


437.12 


107.29 


Cooplata Data 


Matchad Samplas 


439.05 


108.92 


434.63 


107.60 


436.84 


108.26 


Miasins Data, 


Equivalant Samplas 


438. 7S 


108.09 


440.16 


106.38 


439.46 


107.23 


Miaaing Data, 


Unaqual Samplas 


438. 7S 


108.09 


434.70 


104.64 


436.73 


106.37 


Miaaing Data, 


Matchad Samplas 


438. 7S 


108.09 


43S.12 


107.11 


436.94 


107.60 



Lavina aqually raliabla 

HEW to OLDl HEW to 0LP2 AvaraKa 





Maan 


S.D. 


Maan 


S.D. 


Maan 


S.D. 


Cooiplata Data, Equivalant Sainplas (Banchmark) 


439.02 


109.21 


441.07 


106.03 


440.04 


107.62 


Cooplata Data, Unaqual Sainplas 


439.02 


109.21 


440.67 


106.23 


439.84 


107.71 


Cooplata Data, Matchad Samplaa 


439.02 


109.21 


434.63 


107.60 


436.82 


108.40 


Missing Data, Equivalant Samplas 


438.60 


108.20 


440.62 


106.22 


439.61 


107.21 


Missing Data, Unaqual Samplas 


438.60 


108.20 


440. OS 


lOS.SS 


439.32 


106.88 


Missing Data, Matchad Samplas 


438.60 


108.20 


43S.12 


107.11 


436.86 


107.66 



Chainad aquiparcantila 



Cooplata Data, Equivalant Samplas (Banchmark) 

Cooplata Data, Unaqual Sainplas 

Cooplata Data, Matchad Samplas 

Missing Data, Equivalant Samplas 

Missing Data, Unaqual Samplas 

Missing Data, Matchad Samplas 



NEW to OLDl 


NEW to 


0LD2 


Avarat-e 


Maan 


S.D. 


Maan 


S.D. 


Maan 


S.D. 


439.20 


109.20 


440.02 


106.16 


440.03 


107. S7 


439.20 


109.20 


439. 3S 


106.16 


439. 2S 


107. S& 


439.20 


109.20 


434.63 


107.43 


436.89 


108.22 


438.04 


107.08 


440.61 


106. 4S 


439.07 


106.40 


438.04 


107.08 


438.86 


10S.79 


438.21 


106.11 


438.04 


107.08 


43S.22 


107.22 


436.38 


106.80 



IRT 



Cooplata Data, Equivalant Sainplas (Banchmark) 

Cooplata Data, Unaqual Samplas 

Cooplata Data, Matchad Samplas 

Hissing Data, Equivalant Samplas 

Missing Data, Unaqual Samplas 

Missing Data, Matchad Samplas 



NEW to OLDl 


NEW to 


0LD2 


Avaraca 


Maan 


S.D. 


Maan 


S.D. 


Maan 


S.D. 


439.26 


108. S4 


440.12 


107.36 


439.69 


107.95 


438.89 


108.31 


440.78 


108.20 


439.83 


108.26 


438. 9S 


108. S2 


43S.44 


107.73 


437.20 


108.12 


439.58 


108.06 


439.77 


106.91 


439.67 


107.49 


439.20 


107. 8S 


440.82 


107.80 


440.01 


107.82 


439.24 


108.03 


43S.26 


107.64 


437.25 


107.84 
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